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ABSTRACT: 


./hen  the  sizes  of  the  training  sets  are  small,  classification 
in  a subspace  of  the  original  data  space  may  give  rise  to  a smaller 


probability  of  error  than  the  classification  in  the  data  space  itself. 
This  is  bee 


because  the  gain  in  the  accuracy  of  estimation  of  the  likeli- 
hood functions  used  in  classification  in  the  lower  dimensional  space 
(subspace)  offsets  the  loss  of  information  associated  with  dimension- 
ality reduction  (feature  extraction).  To  test  this  conjecture,  a 
computer  simulation  was  performed.  A number  of  pseudo- random 
training  and  data  vectors  were  generated  from  two  four- dimensional 
Gaussian  classes.  vwAn  algorithm  previously  described  (ICSA  Technical 
Report  #275-025-0lC,  EE  Technical  Report  #7520)  was  used  to  create 
an  optimal  one -dimensional  feature  space  on  which  to  project  the 
data.  When  the  sizes  of  the  training  sets  were  small,  classification 
of  the  data  in  the  optimal  one-dimensional  was  found  to  yield  lower 
error  rates  than  the  one  in  the  original  four- dimensional  space. 
Specifically,  depending  on  the  sizes  of  the  training  sets,  the  improve- 


ment ranged  from  11%  to  1%. 
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I.  Introduction : 

In  real  pattern  recognition  systems,  the  situation  often  arises 
that  the  classifier  as  well  as  the  feature  extractor  must  be  designed 
with  a limited  number  of  training  samples*  As  a result,  in  certain 
cases,  the  estimates  of  the  class  conditional  statistics  which  are 
used  to  determine  the  classification  strategy  are  poor. 

When  Gaussian  statistics  are  assumed  and  the  dimension  of  the 
raw  data  is  n,  then  the  n elements  of  the  mean  vector  for 
class  as  well  as  the  (nx(n+l))/2  independent  elements 

of  the  covariance  matrix  for  class  are  estimated  using  the 

formulas : 


-dj  X- 

1 xie  sj 


X <xi  - xj)  <xi  - xi>  . 


J xi  c sj 

where  is  the  training  set  representing  the  class  and 

is  the  total  number  of  training  vectors  x.  in  . 

It  is  well  known  that  the  uncertainties  of  these  estimates 
decrease  monotonically  with  increasing  N.  and  decreasing  n [2]. 

The  number  of  training  samples  necessary  to  obtain  a non-singular 
estimate  for  the  covariance  matrix  must  be  greater  than  or  equal 
to  n + 1 . However,  in  order  to  obtain  a really  good  estimate  of 
the  covariance  as  well  as  the  mean  often  several  times  this  number 
of  training  samples  are  needed  [2]. 

* The  effect  of  dimensionality  versus  sample  size  in  the  estimation 
of  density  functions  has  been  considered  by  a number  of  investigators 
(see  e.g.  L.  Kanal  and  B.  Chandrasekaran,  Pattern  Recognition,  3, 
225-234  [1971]  ) . In  the  present  report  we  investigate  this  question 
empirically  in  the  context  of  one  of  the  feature  extraction  techniques 
developed  by  us. 
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When  the  ratio  / n ( j = 1 , . . . , M)  (where  M = total 
number  of  classes),  tends  to  infinity,  classification  results  obtained 
using  all  available  features  are  superior  to  those  results  obtained 
using  any  transformation  of  the  original  space  into  a lower  dimen- 
sional space.  However,  when  Gaussian  pattern  classes  are  present 
and  the  ratio  / n ( j = 1 M)  is  small,  the  feature  extrac- 

tion method  presented  in  [1]  can  be  of  great  value.  When  these 
conditions  are  satisfied,  one  can  sometimes  obtain  better  classifica- 
tion performance  by  using  the  optimal  single  linear  Gaussian  feature 
than  by  using  all  n features.  This  is  so  because  when  the  dimen- 
sionality of  the  data  is  reduced  to  unity,  the  estimate  of  the  mean 
in  the  reduced  space  is  a one- dimensional  estimate  rather  than  an 
n-dimensional  estimate.  Similarly,  the  estimate  of  the  class 
conditional  covariance  is  merely  the  one-dimensional  variance  estimate 
rather  than  the  n x n dimensional  covariance  matrix  given  by  (2). 
Essentially,  then  the  ratio  Nj  / m (where  m is  the  dimension  of 
the  space  in  which  classification  is  made)  is  increased  with  the 
reduction  of  dimensionality  from  n to  m = 1 . Hence  the  uncer- 
tainties in  the  mean  and  covariance  estimates  are  reduced.  This 


gain  in  accuracy  in  estimation  may  offset  the  loss  of  information 
resulting  from  die  dimensionality  reduction.  Thus,  in  certain  cases, 
results  from  classification  obtained  using  our  optimal  single  linear 
Gaussian  feature  can  give  rise  to  a lower  probability  of  error  than 
those  obtained  using  all  available  features.  Numerical  results  from 
the  computer  simulation  described  in  the  following  section  do  indeed 


s 


w 


II.  Numerical  Results : 


To  verify  the  preceding  argument,  the  following  test  procedure 
was  conducted.  A number  of  pseudo- random  data  vectors  from  two 
four-dimensional  Gaussian  classes  were  generated.  N of  these 
samples  from  each  class  were  used  to  compose  a training  set  from 
which  the  class  conditional  statistics  given  fay  (1)  and  (2)  were 
obtained.  Using  these  estimates,  the  optimal  single  linear  Gaussian 
feature  was  found.  The  remainder  of  the  pseudo-  random  data 
vectors  were  transformed  using  the  optimal  single  linear  Gaussian 
feature  and  were  classified  in  the  reduced  space.  Classification  on 
these  same  samples  were  also  performed  in  the  untransformed  space. 
The  classification  performances,  which  is  the  ratio  of  the  number  of 
samples  classified  properly  .to  the  total  number  of  classifications 
made,  were  computed  for  each  method  and  are  listed  in  Table  I. 

These  results  clearly  show  that  one  can  improve  classification  using 
the  optimal  single  linear  Gaussian  feature  for  small  values  of  N / n . 
At  higher  values  of  N / n , one  may  even  obtain  comparable 
classification  performance. 


Classification  Performance 


Number  of 
Training  Samples 
(N) 

Optimal  Single 
Linear  Gaussian 
Feature 

All  4 Available 
Features 

5 

.590 

.485 

10 

.610 

.600 

20 

.610 

.600 

30 

.605 

.630 

40 

.590 

.630 

50 

.610 

.640 

Table  I Classification  Performance  for 
Varying  Sizes  of  Training  Sets 
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III.  Conclusions : 

In  the  test  case  presented,  it  is  readily  noted  that  for  low 
values  of  N / n , die  classification  performance  obtained  using 
the  optimal  single  linear  Gaussian  feature  exceeds  that  obtained 
using  all  available  features.  Similar  results  were  found  by 
classifying  using  subsets  of  all  available  features  in  [3].  Thus  for 
low  values  of  N / n , one  realizes  certain  advantages  from  using 
this  approach.  First,  a reduction  in  computer  storage  and  mathe- 
matical computation  is  achieved.  More  importantly,  one  may 
improve  the  performance  of  the  classifier. 


The  effect  of  having  a small  number  of  vectors  in  the  training 

set  on  other  algorithms  ought  to  be  explored. 
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